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Abstract 

Long noncoding RNAs (IncRNAs) are noncoding transcripts longer than 200 nucleotides, which show evidence of pervasive 
transcription and participate in a plethora of cellular regulatory processes. Although several noncoding transcripts have 
been functionally annotated as IncRNAs within the genome, not all have been proven to fulfill the criteria for a functional 
regulator and further analyses have to be done in order to include them in a functional cohort. LncRNAs are being classified 
and reclassified in an ongoing annotation process, and the challenge is fraught with ambiguity, as newer evidences of their 
biogenesis and functional implication come into light. In our effort to understand the complexity of this still enigmatic 
biomolecule, we have developed a new database entitled "LncRBase" where we have classified and characterized IncRNAs 
in human and mouse. It is an extensive resource of human and mouse IncRNA transcripts belonging to fourteen distinct 
subtypes, with a total of 83,201 entries for mouse and 133,361 entries for human: among these, we have newly annotated 
8,507 mouse and 14,813 human non coding RNA transcripts (from UCSC and H-lnvDB 8.0) as IncRNAs. We have especially 
considered protein coding gene loci which act as hosts for non coding transcripts. LncRBase includes different IncRNA 
transcript variants of protein coding genes within LncRBase. LncRBase provides information about the genomic context of 
different IncRNA subtypes, their interaction with small non coding RNAs (ncRNAs) viz. piwi interacting RNAs (piRNAs) and 
microRNAs (miRNAs) and their mode of regulation, via association with diverse other genomic elements. Adequate 
knowledge about genomic origin and molecular features of IncRNAs is essential to understand their functional and 
behavioral complexities. Overall, LncRBase provides a thorough study on various aspects of IncRNA origin and function and 
a user-friendly interface to search for IncRNA information. LncRBase is available at http://bicresources.jcbose.ac.in/zhumur/ 
Incrbase. 
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Introduction 

Once set aside as genomic 'junk', the non coding repertoire of 
the transcriptome has steadily emerged to be functionally 
significant guiding factors in the regulation of various biological 
processes impacting cellular development, differentiation, and 
metabolism. Among these, long noncoding RNAs (IncRNAs) have 
recently become the hotspot of attention due to their remarkable 
similarity with protein coding associates: they undergo splicing and 
are most often endowed with 'poly(A) tail [1,2], a feature hitherto 
associated with protein-coding transcripts. LncRNAs are tran- 
scripts longer than a somewhat arbitrary cut-off of 200 nucleotides 
(nts) [3,4], albeit less conserved than protein coding RNAs and 
have high tissue specificity [5-7], thus initially raising the doubt of 
being 'transcriptional artifacts'. However, tiling array studies of the 
human genome point out to the fact that a large fraction of the 
transcription machinery is employed for synthesis and maintenance 
of IncRNAs [8,9] . LncRNAs contribute to a plethora of cellular 
regulatory processes, ranging from X chromosome inactivation, 
genomic imprinting and chromatin modification, to telomere 
elongation, transcriptional activation, and nuclear trafficking 
[10,11]. Parallel studies on IncRNA function and expression in 



different cellular systems have led to the accumulation of massive 
amounts of experimental results, ready to be collated into 
comprehensive, reliable catalogs of IncRNA information. 

With the advent of new technologies achieving unprecedented 
depths in RNA sequencing, several thousands of IncRNAs have 
been identified across the mammalian genome with diverse 
genomic context and mechanistic details [10,12-14]. Employing 
a combination of in silico and wet bench techniques, several 
independent and collaborative efforts have put forward an 
impressive catalogue of IncRNAs, with primary emphasis on 
human and mouse transcripts. IncRNAdb [15] has assembled a list 
of IncRNAs with referenced information about their biological 
functions and expression in different systems. The GENCODE 
consortium [16] has been an extensive resource for human 
IncRNAs till now, and has recendy announced their first catalogue 
of mouse IncRNAs. The NONCODE database (v3.0 and v4.0) 
[17,18] has also grouped together a significant number of human 
and mouse IncRNAs and provided associated information on 
IncRNA cellular localization, function and expression. Other 
recent databases hosting discrete information about different 
aspects of IncRNAs in human include LNCIpedia [19], 
IncRNome [20] and LncRNADisease [21]. Despite such extensive 
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work on IncRNAs, there remain certain domains which have not 
been well defined regarding IncRNA biology and function: one 
such area is to analyze the influence of different regulatory 
elements on the function of IncRNAs and vice versa. Little is 
known regarding the regulatory interactions between IncRNA and 
other small ncRNA classes. Recent reports have suggested that 
IncRNAs could potentially interact with other classes of ncRNAs 
and modulate their functions [22]. Further, IncRNAs can act as 
precursors for small RNAs and can regulate gene expression via 
small RNA dependent mechanism [22]. Comprehensive informa- 
tion on IncRNA association with Repeat Elements of distinct 
Repeat Families, with Imprinted gene loci, and distribution of 
CpG Islands (CGI) in IncRNA promoter regions is still lacking. 
These are crucial aspects to consider while unraveling the 
functional complexity of IncRNAs, and would help us delve deep 
into yet unexplored depths of the cellular regulome. Such 
incompleteness in existing information on IncRNAs motivated us 
to analyze these aspects of IncRNA and develop LncRBase. Here 
we have extensively categorized human and mouse IncRNAs and 
also featured non coding transcript variants of protein coding 
genes, like retained introns, processed transcripts, and ambiguous 
ORF containing non coding transcript variants, obtained from 
Ensembl (Gene 75). 

LncRBase is a comprehensive and user friendly database, with a 
total of 216,562 transcript entries. The database hosts information 
on basic IncRNA transcript features, with additional details on 
their genomic location, overlapping small ncRNAs, association 
with Imprinted genes, and association of Repeat Elements with 
each transcript. LncRNA promoters have also been classified 
based on their association with CGIs. Furthermore, a subset of 
microarray probes has been remapped to IncRNAs and has been 
associated with gene expression signatures of specific disease types. 
The database also hosts IncRNA expression data obtained from 
RNA-sequencing studies in different tissues from human and 
mouse. This would provide further insight into IncRNA function 
with respect to their expression in different tissue systems. Overall, 
LncRBase will serve as a useful resource for both computational 
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Figure 1. Diagram of the genomic context of different IncRNA 
subtypes, (a) CI: Completely Intronic IncRNA, IA: Intronic Antisense 
IncRNA and IO: Intron Overlapping IncRNA; (b) 3UO: 3/UTR Overlapping 
IncRNA, 5UO: 5/UTR Overlapping IncRNA, CD: CDS overlapping IncRNA 
and AN: Antisense IncRNA; (c) SO: Sense Overlapping IncRNA; (d) LI: 
Intergenic IncRNA; 
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and experimental biologists to browse, search and retrieve 
information on human and mouse IncRNAs. 

Results And Discussion 

Distribution of different IncRNA subtypes based on their 
genomic location 

LncRBase hosts fourteen distinct subtypes of IncRNAs including 
our newly classified types and other transcript biotypes. These are 
as follows: 

(1) 3UO, 3/UTR overlapping IncRNAs overlapping any 3/ 
UTR exon in the sense strand. (2) 5UO, 5/UTR overlapping 
IncRNAs overlapping any 5/UTR exon in the sense strand. (3) 
CDS, CDS overlapping IncRNAs overlapping any CDS exon. 
(4) LI, Intergenic (line) IncRNAs transcribed from in between 
two gene loci. (5) AN, Antisense IncRNAs intersecting any exon 
of a protein-coding locus on the opposite strand. (6) CI, 
Completely Intronic transcripts residing within introns of a 
coding gene, but do not intersect any exons. (7) IA, Intronic 
Antisense IncRNAs completely overlapping with an intron in 
the opposite strand.(8) IO, Intron Overlapping IncRNA splice 
variants of a gene, contain intronic sequence. (9) PS, Pseudogene 
transcripts having homology to protein coding transcripts but 
containing disrupted coding sequence and an active homologous 
gene can be found at another locus. (10) SO, Sense Overlapping 
IncRNAs containing a coding gene in its intron on the same 
strand. (11) AO, Ambiguous ORF transcripts believed to be 
protein coding, but with more than one possible open reading 
frame.(12) PT, Processed Transcripts not containing an ORF 
(obtained from Ensembl dataset).(13) MI, miscRNA from the 
Ensembl transcript dataset. (14) NC, Non coding transcripts not 
falling in any of the above mentioned categories. Diagrammatic 
illustration of the subtypes are given in Figure 1. 

LncRNAs are known to be gready varied in length, starting 
from the popular consensus of 200 bps upto ~9 Kb. Since 
transcript length influences their secondary structure formation 
and functional variation [23,24], hence to show the length 
distribution of human and mouse IncRNAs we have plotted the 
corresponding length distribution graph [Figure SI]. We observe 
that most IncRNAs fall within the 500-1,000 bp length window, in 
both the organisms. 

Distribution of non coding IncRNA (i.e. with no coding 
potential) subtypes follows the pattern presented in Figure 2(a- 
b). Lis are the most abundant type of IncRNAs in human followed 
by PT IncRNAs [Figure 2(a)]. In mouse, majority of IncRNAs 
fall in the LI type, followed by CI IncRNAs [Figure 2(b)]. 
Inclusion of FANTOM3 transcripts in the dataset might be the 
reason for the abundance of intron-associated transcripts in mouse 
[25]. Abundance of intergenic (LI) IncRNAs reflects the extent of 
annotation of different subtypes of IncRNAs, intergenic IncRNAs 
being extensively categorized in previous works. 

Ambiguous IncRNAs overlap with multiple genomic 
elements 

Some IncRNAs map to more than one genomic locus and have 
been grouped accordingly under two or more subtypes. These 
IncRNAs can have more than one type of association with 
different coding gene elements, like introns and exons. An example 
would include uc021ssq.l(a UCSC transcript), which is both AN 
and IO. The ID assigned to such IncRNA transcript having dual 
identity is hsaLB_AN_88674.1 and hsaLB_IO_88674.1 with a 
change only in the 'subtype' part of the identifier [AN and IO]. 
Due to such genomic context, these IncRNAs are open to multiple 
interpretations and have been grouped as 'Ambiguous IncRNAs'. 
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Figure 2. Distribution of non coding IncRNA subtypes. Distribution of noncoding IncRNA subtypes in (a) human and (b) mouse. Abbreviations: 
(1) 3UO, 3/UTR overlapping IncRNAs (2) 5UO, 5/UTR overlapping IncRNAs (3) CDS, CDS overlapping IncRNAs (4) LI, Intergenic (line) IncRNAs (5) AN, 
Antisense IncRNAs (6) CI, Completely Intronic (7) IA, Intronic Antisense IncRNAs (8) 10, Intron Overlapping IncRNA (9) PS, Pseudogene IncRNAs (10) SO, 
Sense Overlapping IncRNAs (11) AO, Ambiguous ORF transcripts (12) PT, Processed Transcripts (13) Ml, miscRNAs (14) NC, Non coding transcripts. 
doi:1 0.1 371 /journal.pone.01 0801 0.g002 



3,578 human IncRNAs (2.7% of total transcripts) and 3,731 mouse 
IncRNAs (4.7% of total transcripts) show such ambiguous 
behaviour. These IncRNAs can be separately searched in the 
database and have also been provided in Data SI. 

We have also considered non coding transcript variants of 
protein coding genes and included them in our list of IncRNAs. 
47,598 and 23,124 newly predicted human and mouse IncRNAs 
respectively have been obtained from the retained intron (included 
in our Intron Overlapping subtype), ambiguous ORF and 
processed transcript variants of protein coding genes in Ensembl 
(Gene 7 5). A detailed list of such transcripts and corresponding 
genes can be found in Data S2. 

Coding potential of IncRNAs 

It has been stated that certain genes have bifaceted transcript 
outputs that participate in distinct spectrums of gene regulatory 
interactions. A well known example would be SRA1, which 



shows bifunctionality, both as an RNA regulator and a 
functional protein encoder [26], mediated by alternative splice 
variants. The protein-coding longer SRA1 isoforms include the 
same core sequence as needed for the regulatory IncRNA 
function which is thus concluded to be bi-functional. This 
necessitates careful examination of the coding capability of a 
putative non-coding transcript before it could satisfy such 
paradigm of bifunctionality. We have calculated the coding 
potential of each transcript using Coding-Potential Assessment 
Tool (CPAT) [27]. In order to imply a selective inclusion 
criteria for specifying an IncRNA, we have separated transcripts 
showing a positive CPAT coding probabilty score and grouped 
them as 'putatively coding' transcripts which would warrant 
further stringent investigations. Human transcripts with Coding 
Probability (CP) score <0.364 were declared noncoding and 
those with CP> = 0.364 were declared putatively coding. CP 
threshold used for mouse was 0.44 (CP<0.44 was non coding 




Figure 3. Distribution of putatively coding IncRNA subtypes. Distribution of putatively coding IncRNA subtypes in (a) human and (b) mouse. 
Abbreviations: (1) 3UO, 3/UTR overlapping IncRNAs (2) 5UO, 5/UTR overlapping IncRNAs (3) CDS, CDS overlapping IncRNAs (4) LI, Intergenic (line) 
IncRNAs (5) AN, Antisense IncRNAs (6) CI, Completely Intronic (7) IA, Intronic Antisense IncRNAs (8) IO, Intron Overlapping IncRNA (9) PS, Pseudogene 
IncRNAs (10) SO, Sense Overlapping IncRNAs (11) AO, Ambiguous ORF transcripts (12) PT, Processed Transcripts (13) Ml, miscRNAs (14) NC, Non 
coding transcripts. 

doi:1 0.1 371 /journal.pone.01 0801 0.g003 
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Figure 4. Distribution of Repeat Elements across IncRNAs. Distribution of different Repeat Families across IncRNAs in (a) human and (b) 
mouse compared to protein coding transcripts, (c) Repeat associated IncRNA subtypes in human and mouse. Abbreviations: (1) 3UO, 3/UTR 
overlapping IncRNAs (2) 5U0, 5/UTR overlapping IncRNAs (3) CDS, CDS overlapping IncRNAs (4) LI, Intergenic (line) IncRNAs (5) AN, Antisense IncRNAs 
(6) CI, Completely Intronic (7) IA, Intronic Antisense IncRNAs (8) IO, Intron Overlapping IncRNA (9) PS, Pseudogene IncRNAs (10) SO, Sense 
Overlapping IncRNAs (11) AO, Ambiguous ORF transcripts (12) PT, Processed Transcripts (13) Ml, miscRNAs (14) NC, Non coding transcripts. 
doi:1 0.1 371 /journal.pone.01 0801 0.g004 



and CP> = 0.44 was putatively coding). CP threshold values 
considered for calculating non coding and putatively coding 
transcripts were as per CPAT documentation. Since many 
IncRNAs reported so far have not yet undergone experimental 
validation regarding their protein coding capabilities, this will 
serve as a reference score for users to select and analyze specific 
IncRNAs based on their coding potential. 

Figure 3(a— b) demonstrates subtype wise distribution of 
putatively coding IncRNAs in human and mouse. It is logical to 
assume that some of these transcripts might exhibit a bifunctional 
mode of operation, with dual role as a regulator and a messenger, 
since some IncRNAs have been reported to have such a functional 
dichotomy [28,29]. However, since recent investigations have 
redefined the coding capacity of previously annotated IncRNA 
transcripts with stringent, high-confidence annotation protocols 
[30], these putatively coding IncRNA transcripts would serve as a 
preliminary dataset for sorting out potential transcripts with 
proposed dual mode of function, which have been wrongly tagged 
with the 'ncRNA' moniker. 

Association of IncRNAs with small ncRNAs 

LncRNAs interact in a well-regulated and orchestrated 
manner with various biomolecules to participate in a multi- 
layered integrated regulatory circuitry. It has been in reports 
recently that IncRNAs harbour small ncRNAs [8]. miRNAs are 
a class of small ncRNAs which are ~22 nt short endogenous 
RNAs that comprise of the most highly explored class of gene 
regulatory molecules in multicellular organisms [31]. They not 
only function as microregulators of protein-coding genes but also 
interact with and regulate the functions of different ncRNAs 
[32]. Large scale transcriptome analyses have pointed out several 
sites of miRNA-lncRNA interaction across the genome [33]. 
LncRNA-miRNA interaction acts as an additional strata in the 
regulatory interactome, where IncRNAs act as miRNA quench- 
ers and promote gene expression and subsequent functional 
manifestation [34]. piRNAs are small ncRNAs of 25-33 nts in 
length, that are derived from transposable elements within the 
genome. They are involved in cellular epigenetic programming 
via pairing with piRNA-complementary binding sites in the 
genome which act as guidance cues for the recruitment of 
epigenetic factors in target sites [35]. LncRNAs associated with 
germline specific piRNA clusters during male germline develop- 
ment may function to regulate gene expression via piRNA- 
mediated epigenetic mechanisms [36]. 



Based on such reports we sought to find out a positional 
preference for small RNA abundance viz. piRNA and miRNA 
abundance within certain IncRNA loci. 

miRNA associated IncRNAs: 2,624 human and 941 mouse 
IncRNAs mapped with miRNA primary transcripts in the same 
strand within human and mouse genome respectively. Subse- 
quently deepBase [37] annotated small RNA clusters (deepBase 
contains small RNA sequencing data from multiple experiments) 
were mapped to these primary miRNAs. We considered the 
IncRNA associated primary miRNA transcripts which constitutes 
the small RNA clusters. A Significance Score (for an IncRNA 
transcript j) was assigned to assess the primary miRNA transcript 
abundance within that IncRNA locus (details on Significance 
Score is provided in Materials and Methods section). 

piRNA associated IncRNAs: Human and mouse piRNAs were 
mapped to human and mouse genome respectively and piRNA 
clusters were computed following the definition of Lau el al [38] 
(discussed in Materials and Methods section). These piRNA 
clusters were mapped to IncRNA transcripts which resulted in 
1,302 and 2,547 piRNA cluster-associated IncRNA transcripts in 
human and mouse respectively. A Significance Score (for an 
IncRNA transcript fj was calculated, based on the number of 
piRNAs (constituting a particular piRNA cluster) to assess the 
piRNA abundance within that particular IncRNA locus (details on 
Significance Score is provided in Materials and Methods). 

Overall, these miRNA and piRNA associated IncRNAs can 
harbour such small ncRNAs and the Significance Score repre- 
senting the abundance of mature piRNAs or primary miRNAs 
within a particular IncRNA reveals the plausibility of biogenesis of 
these small RNAs from the IncRNA transcript. 

Associated Repeat Elements 

LncRNAs containing Alu repeats can participate in post- 
transcriptional regulation of protein coding RNAs through 
imperfect base pairing with 3/UTR Alu elements and targeting 
mRNA transcripts for Staufen-mediated decay (SMD). These 
lncRNA-3/UTR base-pairing interactions create double-strand- 
ed STAU 1 -binding sites in mRNA 3/UTRs, inducing Staufen 1 
(STAU1) binding, resulting in destabilization and degradation of 
the target mRNA [39,40]. Repeat containing IncRNAs are also 
known to take part in translational regulation of an mRNA. 
SINEB2 Repeat Element containing IncRNAs complement with 
5/UTR of mRNAs in a head- to-head fashion and upregulate 
translation. Well known examples include mouse UchllAS 
IncRNA and antisense KCS1 IncRNA in yeast [39]. A mutated 
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Figure 5. Distribution of CGI and non-CGI promoters among the various IncRNA subtypes. CGI and non-CGI promoters of each subtype 
of IncRNA (a) in human (b) in mouse. Abbreviations: (1) 3UO, 3/UTR overlapping IncRNAs (2) 5UO, 5/UTR overlapping IncRNAs (3) CDS, CDS 
overlapping IncRNAs (4) LI, Intergenic (line) IncRNAs (5) AN, Antisense IncRNAs (6) CI, Completely Intronic (7) IA, Intronic Antisense IncRNAs (8) IO, 
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Transcripts (13) Ml, miscRNAs (14) NC, Non coding transcripts. 
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LI element in an IncRNA is associated with infantile 
encephalopathy [41]. HERVH or Human endogenous retrovi- 
rus subfamily H is a class of transposable elements essential for 
both development and maintenance of pluripotency in somatic 
cells [42]. All these evidences point out the functional versatility 
of Repeat-associated IncRNAs and the importance of repetitive 
sequences in IncRNA transcripts. Several IncRNAs contain 
functional repeat sequence domains [43] and lincRNAs have 
been shown to contain a significant proportion of highly 
repetitive transposable elements (TE) [44]. Hence, we mapped 
Repeat Elements (belonging to different Repeat Families) to 
IncRNA loci and analyzed the distribution of Repeat Elements 
in individual IncRNA transcripts. Figure 4(a-b) shows the 
significance of the association of Repeat Elements with IncRNAs 
compared to that with protein coding transcripts. SINE and 
DNA Repeat Families are more abundant in IncRNAs in both 
human and mouse compared to other Repeat Families. Among 
the different IncRNA subtypes, lincRNAs (LI) are the most 
abundant class of Repeat-associated IncRNAs both in human 
and mouse [Figure 4(c)]. This is in line with previous 
observation that intergenic IncRNAs tend to be associated with 
transposable elements [44]. Processed Transcript (PT), Intron 
Overlapping (IO) and Antisense (AN) IncRNAs also show a 
significant association with Repeat Elements in both human and 
mouse. Overall, IncRNA-Repeat associations would provide 
further insight into yet another facet of genome regulation via 
noncoding mediators containing repeat domains [45]. 

Associated Imprinted genes 

Imprinted gene clusters contain one or several IncRNAs 
functioning as as-acting silencers of neighbouring protein coding 
genes [46] . LncRNAs were mapped to Imprinted genes in order to 
find out overlapping transcripts. A total of 918 and 415 lncRNA- 
imprinted gene associations were found in human and mouse 
respectively. 

Imprinted ncRNAs show different imprinting features com- 
pared to imprinted protein-coding genes, and have a greater 
participation in the mechanism of genomic imprinting. These 
imprinted ncRNAs coexist with large imprinted regions in the 
genome and act as key players in the evolution and regulation of 
genomic imprinting [47]. This provides further insight into the 



function of IncRNAs as potential regulators of imprinting and the 
expression of other genes associated with imprinted loci. 

IncRNA promoter analysis 

CpG Islands (CGIs) are generically equipped to influence 
local chromatin structure and regulate gene activity. CGI 
promoters have their own distinctive chromatin configuration 
and show specific patterns of transcription initiation [48]; 
modifications like cytosine methylation in the CpG moieties 
themselves result in stable shutdown of the associated promoter 
[48]. 45% of all human gene promoters, particularly the tissue 
specific gene promoters, do not lie within CGIs [49]. These 
non-CGI promoters are subjected to DNA methylation, 
regulating the establishment and maintenance of tissue-specific 
expression patterns [50]. LncRNA promoter regions (—1 to + 
1 kb from TSS) of each IncRNA transcript in mouse and 
human were mapped with CGIs to classify them as CGI or 
non-CGI promoters. We found 35,674 and 19,957 CGI 
IncRNA-promoters in human and mouse respectively. We have 
recorded the CGI name, %GC content, CpG density and type 
of overlap with the IncRNA promoter region. Most of these 
CGI promoters showed a high CpG content, indicative of 
distinct functional status of the IncRNA transcripts. Non-CGI 
promoters are more abundant in IncRNAs: 93,144 and 58,646 
non-CGI promoters were found in human and mouse 
respectively. Figure 5(a— b) gives a distribution of CGI and 
non-CGI promoters of each biotype of IncRNA in human and 
mouse genomes. 

Abundance of non-CGI promoters is consistent across most of 
the IncRNA subtypes in both human and mouse. Given the 
evidence regarding prevalence of non-CGI promoters in 
establishment of tissue-specific gene expression patterns, this 
feature corroborates their association with IncRNA transcripts, 
which are also prone to maintain a distinct tissue-specific 
expression profile. An exception is noted in case of exon 
overlapping (5UO, CDS, AO) IncRNA subtypes where abun- 
dance of CGI promoters point towards the possibility of 
divergent transcription from protein coding gene promoters 
[51] giving rise to non coding IncRNA isoforms. 
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Figure 6. Different web interfaces allow easy view of IncRNA information, (a) Search page provides multiple options for searching IncRNA 
information, (b) General Output page displays basic information about an IncRNA transcript and provides multiple options for probing into further 
details, (c) Detailed Output page displays complete information about an IncRNA transcript, (d) LncRBrowse allows user to browse through different 
genomic annotation tracks. 
doi:1 0.1 371 /journal.pone.01 0801 0.g006 



Microarray probe classification and disease association 

Recent studies point out that certain IncRNA sequences 
uniquely map with conventional, pre-annotated microarray 
probesets. Hence, available microarray data can be mined to 
obtain IncRNA expression profiles [52-54]. 4,631 Affymetrix 
GeneChip Human Genome U133 Plus 2.0 Array probes and 
2,707 Aflymetrix GeneChip Mouse Genome 430 2.0 Array probes 
in human and mouse respectively were remapped to IncRNA 
transcripts. From Gene Expression Barcode [55,56], we obtained 
consensus gene expression signatures of different tumor tissues in 
human and mouse. A binary version of expression values or a 
'barcode' is assigned to provide expression calls for all genes on the 
Affymetrix GeneChip Human Genome U133 Plus 2.0 Array and 
Affymetrix GeneChip Mouse Genome 430 2.0 Array. Based on 
this barcode, probes are assigned an expression call of 1 or 0 to 
denote expressed and silenced calls respectively based on their 
expression in a particular tissue type. Probes remapped to 
IncRNAs were finally matched to these probes associated with 
different tumor tissue specific consensus gene expression datasets. 
These disease-associated IncRNAs would serve as a starting 



dataset for subsequent experiments which are essential to conclude 
about the actual state of expresssion of these IncRNAs in respective 
disease systems. 

IncRNA expression in tissues 

We examined the expression patterns of human and mouse 
IncRNAs in different tissues. To this end, we downloaded RNA- 
Seq dataset comprising of 12 cell lines in human (GSE30567) and 
6 tissues in mouse (GSE30352) from Gene Expression Omnibus 
(http://www.ncbi.nlm.nih.gov/geo/) and estimated IncRNA ex- 
pression in these tissues. Spliced alignment was carried out using 
TopHat2 [57] and IncRNA transcript expression was thus 
estimated (in FPKM) using Cufflinks [58]. Analysed IncRNA 
expression data for different tissues is available in our database. 

Search and output options 

LncRBase primarily processes the user query through simple 
search options, which in turn retrieve information from the 
relational database tables, format the result and display it on the 
web interface [Figure 6(a— c)]. Six different search options are 
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provided by which a user can look for IncRNA transcript 
information which are as follows: 

1. Search by IncRNA Accession ID: User can input specific 
IncRNA Accession ID a) from LncRBase like hsaL- 
B_IO_89 175.2 for human and mmuLB_AN_52880 for mouse 
or from b) any of the source databases (Ensembl Gene75, 
UCSC ID, NONCODE v3.0, H-InvDB 8.0 or Human Body 
Map lincRNAs) to view detailed information about that 
transcript. 

2. Search by IncRNA Gene Symbol: User can input known 
IncRNA Gene Symbol to search for the number of transcript 
entries for that gene recorded in LncRBase and associated 
literature references for that gene; clicking on any LncRBase 
ID will direct the query to a webpage containing detailed 
information about that particular transcript. 

3. Search IncRNA in Disease: Selecting any particular disease will 
show the probes associated with a consensus expression profile 
for that particular disease type and the IncRNAs mapped to 
each of these probes. Further experiments in future are to be 
done to check the actual expression of these IncRNAs in the 
corresponding disease states. 

4. Browse by IncRNA subtype and coding potential: Any specific 
IncRNA subtype can be selected to view basic information 
about the corresponding IncRNAs. Given specific genomic 
locus as input, the ouput will display all IncRNAs mapped in 
that position. One can add an extra search option by which 
they can check IncRNA transcripts as per their coding 
potential. 

A separate section within the fourth search option allows to 
check for Ambiguous IncRNA transcripts which belong to 
multiple subtypes. User can give a chromosome wise search 
based on coding potential. 

5. Search for ncRNA associated IncRNAs: User can check for 
IncRNAs associated with small ncRNAs viz. primary miRNAs 
and piRNAs by a chromosome wise search in respective 
genomes. 

6. Search IncRNA expression: 'Search IncRNA expression' page 
allows the user to view expression of IncRNA transcripts 
(belonging to different subtypes), obtained from RNA-Seq data 
within different tissues in human and mouse. 



Visualizing LncRBase 

For viewing the information related to IncRNAs, a browser 
(IncRBrowse) has been integrated. This browser provides an 
integrated view of Refseq Genes, Refseq Transcripts, IncRNA 
subtypes, IncRNA Promoter, CpG Islands, Repeat Masker 3.27 
Repeats, miRBase (v20) primary miRNA transcripts and mature 
miRNAs, piRNAs and SNPs from dbSNP. Browsing through 
these tracks would allow the user to check for their association with 
respect to each other [Figure 6(d)] . Details of each annotation 
track has been provided as a pdf in Browse and Help page of 
LncRBase and also as Data S3. 

Materials and Methods 

Data procurement 

All transcript sequence information corresponds to human hgl9 
and mouse mm 10 genome assemblies respectively. The cDNA 
sequences for both mouse and human ncRNAs were obtained 
from different sources which are listed in Table 1 . Since input file 
formats were different such as fasta, bed and gff3, we have 
developed custom scripts for extraction of sequences and 
annotations, which are stored in our database. Genomic elements, 
including 5/UTR exons, 3/UTR exons, CDS exons, introns 
(RefSeq annotations), CGIs and fasta sequences of IncRNA 
promoter regions were downloaded from the University of Santa 
Cruz (UCSC) Table Browser data retrieval options. miRNA 
related information was downloaded from mirBase20 [59]. piRNA 
sequences were downloaded from National Centre for Biotech- 
nology Information (NCBI) [60] in fasta format. Imprinted genes 
and their annotations were downloaded from Geneimprint 
[http:/ /www.geneimprint.com]. For re-annotating the microarray 
probes, we downloaded Affymetrix GeneChip Human Genome 
U133 Plus 2.0 Array and Affymetrix GeneChip Mouse Genome 
430 2.0 Array probe sequences from manufacturer's website 
[http://www.affymetrix.com]. Consensus gene expression signa- 
tures for different tumor tissues were downloaded from Gene 
Expression Barcode database for different tumor tissues [55,56]. 
RNA-Seq datasets were downloaded from NCBI Gene Expression 
Omnibus (GEO) [61]. For human, we downloaded the ENCODE 
Cold Spring Harbor Lab Long RNA-seq data (GSE30567) and for 
mouse we used PolyA+ RNA-Seq data from GSE30352. For 
curating available literature on IncRNAs, we downloaded all 
PubMed evidences from NCBI using the keyword 'RNA and then 
screened out relevant records using gene name search. 



Table 1. Source of transcripts for LncRBase. 





Database 


Version 


Organism 


Number of transcripts 


Ensembl [16] 


Gene 75 


Human 


93753 






Mouse 


36069 


UCSC Genome Browser database [68] 




Human 


15321 






Mouse 


15141 (noncoding transcripts) 


NONCODE [17] 


v3 


Human 


33801 






Mouse 


36991 


Human bodymap lincRNAs [14] 




Human 


14353 


H-lnvDB [69] 


8.0 


Human 


20395 (noncoding transcripts) 


Total set of nonredundant transcripts 






128818 (human) 


78603 (mouse) 
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Figure 7. Classification scheme for LncRBase transcripts. The 

transcripts selected were either pre-annotated as IncRNAs or simply 
mentioned as non coding transcripts. In the former case, additional 
details were provided about the transcripts, while in the latter case, 
transcripts > = 200 nts were selected and characterized as IncRNAs. 
doi:1 0.1 371 /journal.pone.01 0801 0.g007 

Data processing and refinement 

A. Redundancy check and assigning Alias ID. Initial sets 
of ncRNA transcripts taken from different sources [Table 1] 
were classified into two different types [Figure 7] based on their 
existing annotation. The already annotated IncRNAs were 
grouped into 'Existing IncRNAs'. The rest were grouped as 
'Others'. This second group of ncRNA transcripts was filtered 
based on the length of the transcript sequences, with a cut off of 
& 200 nts to comply with the basic criteria of an IncRNA 
transcript. 

The resultant dataset needed further refinement for redun- 
dancy removal since the data sources mentioned above contain 
a varying degree of overlapping data. To decide whether two 
primary entries might represent the same IncRNA, we 
considered their chromosomal location and sequence similarity 
as reference. LncRNAs were aligned to each other using 
BLAST-Like Alignment Tool (BLAT) [62] with default param- 
eters. We considered alignments having block count = 1 and 
mismatch = 0. We calculated percent-overlap of the both the 
reference and the query transcript sequence. Transcripts having 
more than 99% sequence similarity and overlap in genomic 
coordinates were considered to be 'Identical'. The rest of the 
transcripts did not have any redundant entry. The 'Identical' 
transcripts were finally kept as one unique entry in LncRBase 
and assigned a unique LncRBase ID. The redundant transcripts 
were mentioned as 'Alias IDs'. 

B. Classification of IncRNAs and assignment of unique 
transcript IDs. All the IncRNA transcripts were classified and 
annotated according to their location, with reference to protein- 
coding gene elements in their respective genome. This was done 
by comparing their chromosomal coordinates to those of a 
comprehensive list of pre-annotated genomic elements, including 
5/UTR exons, 3/UTR exons, CDS exons, and introns. 

Based on the chromosomal location, we have assigned a unique 
identifier to each transcript. The IncRNA transcripts have been 
designated a unique identifier as [abbreviation of species name] 
LB_ [subtype] _ [number] . Abbreviation of human and mouse are 



'hsa' and 'mmu' respectively. LB is the abbreviation of LncRBase, 
'subtype' is the distinct subtype of an IncRNA transcript. The 
'number' is assigned to the transcript based on its positional/ 
sequential occurrence within the genome. 

E.g. :(a) hsaLB_CI_482 implies a human IncRNA transcript 
whose subtype is Completely Intronic (CI) and its sequence of 
occurrence within the human genome is 482. 

(b) mmuLB_CI_7 implies a mouse IncRNA transcript whose 
subtype is Completely Intronic (CI) and its sequence of occurrence 
within the mouse genome is 7. 

For each of the cDNA transcript variants which have the same 
chromosomal locus, the number is extended by a numerical index 
Eg: IDs hsaLB_CI_8429.1 and hsaLB_CI_8429.2 imply that there 
exist two cDNA transcript variants from the same genomic locus. 

C. Assessment of coding capacity. Standalone Coding- 
Potential Assessment Tool (CPAT) [27] was used to check the 
coding probability of the IncRNA transcripts. CPAT has high 
accuracy (0.967) and efficiency (10,000 times faster than CPC 
[63] and PhyloCSF [64]). For human, a coding probability 
threshold of 0.364 was used as cut-off. Transcripts with Coding 
Probability (CP) score <0.364 were declared non coding and 
those with CP> = 0.364 were declared putatively coding. CP 
threshold used for mouse was 0.44 (CP<0.44 was non coding 
and CP> = 0.44 was putatively coding). CP threshold values 
considered for calculating non coding and putatively coding 
transcripts were as per CPAT documentation. Briefly, nonpara- 
metric two-graph ROC curves are used to determine an optimal 
CPAT score threshold that maximizes the discriminatory power 
and a score threshold of 0.364 gave the highest sensitivity and 
specificity (0.966 for both) for human data [27]. The CPAT 
score threshold of 0.44 was calculated similarly for mouse. 

D. Mapping piRNAs and miRNAs to IncRNAs. piRNA 
associated IncRNAs: Human piRNA sequences reported by 
Girard et al [65] and mouse piRNA sequences reported by 
Girard et al [65] and Lau et al [38] were obtained from NCBI 
[60] and mapped to human and mouse genome respectively. As 
piRNAs are known to originate in the system as piRNA clusters, 
possible piRNA clusters were computed following the definition of 
Lau et al [38] (minimum piRNA density of 20 per Kbase, window 
span of 20 Kbases and window increment of 1 Kbase). LncRNA 
transcripts were mapped to these piRNA clusters. Based on the 
number of piRNAs (constituting a particular piRNA cluster) 
occurring within an IncRNA transcript locus, a Significance Score 
(for an IncRNA transcript j) was calculated to assess the piRNA 
abundance within that particular IncRNA locus as given by the 
following formula: 

Significance Score, 

Mapped piRNA count (present within each piRNA cluster) 
IncRNA transcript length 

miRNA associated IncRNAs: Primary miRNA sequences 
obtained from miRBase v20 [59] were mapped to the IncRNA 
transcripts. Subsequently deepBase [37] annotated small RNA 
clusters (this database contains small RNA sequencing data from 
multiple experiments) were mapped to primary miRNA associated 
IncRNA transcripts. Based on the number of small RNA reads 
(constituting each RNA cluster) that mapped to these IncRNA 
transcripts, the Significance Score (for an IncRNA transcript j) was 
calculated to assess the primary miRNA abundance within that 
IncRNA locus as is given by the following formula: 
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Figure 8. Workflow diagram of LncRBase. LncRBase is compiled by subsequent importing, naming, analysis and visualization of IncRNA 
transcripts. Every IncRNA transcript is subsequently characterised using multiple parameters, and the results are stored in the database. A web- 
interface built using Perl enables IncRNA visualization and database querying. 
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Significance Score, 

^ — \L'luslers within Irtmsaipl 

y ._ l small RNA read counl (within each small RNA cluster) 

IncRNA transcript Jength 

Significance Score, represents the abundance of piRNAs or 
primary miRNAs within a particular IncRNA transcript. 

E. Remapping of microarray probes to 
IncRNAs. Affymetrix GeneChip Human Genome U133 Plus 
2.0 Array and GeneChip Mouse Genome 430 2.0 Array probe 
sequenes were aligned to IncRNA sequences using BLAT with 
default parameters. Alignment results were filtered by a criteria of 
block count = 1, mismatch = 0, match size = query size, strand = 
negative. Filtered probes were selected to be mapped to IncRNAs. 

F. RNA-Seq expression study. For RNA-Seq data, all 
sequenced reads from each tissue type were aligned to the human 
and mouse reference genome using the spliced read aligner 
TopHat2 [5 7] . Transcript assembly and abundance estimation of 
each tissue type was performed using Cufflinks [58] . 

Implementation 

All computational programs for the collection, sorting and 
redundancy removal of the data and the genome mapping of 
putative IncRNA transcripts to exons, introns, miRNA primary 
transcripts, piRNA, Imprinted genes, CGIs were executed using 
custom Perl scripts and UNIX shell scripting languages. 

LncRBase has been developed as a relational database using 
MySQL. This web server runs in a Linux environment. The 



search engine is powered by apache http deamon. The interface 
layer has been designed using HTML/CSS and the database is 
connected to the web interface using perl CGI module. Server side 
Perl scripts are implied to connect and query LncRBase using perl 
DBI module, and to generate dynamic HTML pages to produce 
output [Figure 8]. LncRBrowse has been implemented using 
JBrowse 1.9.8 [66,67], and the browse interface of LncRBase is 
connected to IncRBrowse using php script. 

Conclusion 

The discovery of several thousands of IncRNAs and current 
upsurge in IncRNA annotation and characterization has added 
another layer of complexity towards understanding mammalian 
genomes and transcriptomes. Given the ever increasing number of 
transcripts identified as IncRNAs, it has not yet been possible to 
precisely define the functional repertoire of these versatile 
transcripts. A combination of in silico and laboratory-based 
approaches is needed to analyze IncRNA biogenesis and their 
various functional intricacies. Our contribution to the rapidly 
expanding field of 'IncRNomics' has been directed towards 
constructing a well collated IncRNA catalog incorporating our 
findings and those of other published works in the form of a 
comprehensive database. We have classified IncRNAs based on 
their genomic position relative to known protein-coding genes. We 
have analyzed the association of IncRNAs with Repeat Elements, 
CGIs, Imprinted genes, small ncRNAs like primary miRNAs and 
piRNAs. Microarray probe sets have been remapped to IncRNAs 
and associated with different disease systems. LncRNA expression 
information has also been provided which will help towards 
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understanding tissue-specific behaviour of these multifunctional 
transcripts. 

LncRBase will serve as an enriched resource for IncRNAs with 
respect to data and information content. Six important features 
are key points of LncRBase: (i) elucidating non coding transcript 
variants of protein coding genes, (ii) usage of a unique identifier for 
each IncRNA transcript, (iii) analysis of IncRNA promoter regions, 
(iv) association of IncRNA transcripts with primary miRNAs and 
piRNAs (v) association of IncRNA transcripts with Imprinted 
genes and (vi) association of IncRNA transcripts with Repeat 
Elements. These, along with other detailed information available 
are expected to make LncRBase a useful resource for IncRNA 
research in human and mouse systems. 

LncRBase integrates information of varied content starting from 
basic sequence information, extending to categorization based on 
genomic context, coding potential score, re-annotated microarray 
probes, associated disease information and IncRNA expression in 
different tissues in human and mouse. LncRBase is designed to 
enable integration with other resources, including the UCSC 
Genome Browser, Ensembl, NONCODE v3.0 and other data- 
bases, thus providing an integrated repository for IncRNAs. 

With the advances in next generation sequencing technology, 
more IncRNA genes are expected to be discovered. LncRBase will 
incorporate these newly annotated IncRNA sequences to update 
existing information. We plan to incorporate structure based 
classification information on the IncRNA transcripts. LncRBase 
has the potential to become a community resource for IncRNA 
transcript information and annotation. 

Availability 

LncRBase is freely available at http:/ /bicresources.jcbose.ac. 
in/zhumur/lncrbase/. The LncRBase data files can be freely 
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